7 Data Manipulation

7.1 Learning Outcomes

By the end of this section, you should:

be confident in cleaning datasets, preparing them for analysis
understand how to manipulate variables, and create new variables, in preparation for analysis

7.2 Introduction

A lot of the data you’ll deal with in sport is ‘messy’. That is, it may not land on your screen in a format that is immediately ready for analysis.

Therefore, some of the key steps we need to take include:

making sure we can read and write the data files;
making sure the variables are named consistently and accurately;
dealing with missing data;
dealing with outliers;
making sure each variable type is defined correctly;
we have all the variables that we need to conduct our analysis.

The following sections describe how a number of these steps can be achieved in R. From my experience, this stage is likely to be the most time-consuming (and frustrating) part of any data analysis, so it’s worth making sure you fully understand the commands and processes outlined below.

7.3 The ‘tidyverse’ package

tidyverse is a collection of R packages designed for data manipulation, exploration, and visualisation. Make sure you have it installed.

As noted previously, there are lots of different ways to do the same thing in R. We’re going to stick to using the tidyverse library.

Two of the core packages within tidyverse, readr and dplyr, provide really useful functions for reading and writing data in various formats.

You encountered one part of the tidyverse package, tibbles, in the previous reading (‘Data Structures in R’). However, there are a lot of other functions within this package that make it well worth learning.

7.4 Reading and Writing Data using `tidyverse`

Reading Data - `readr`

readr is a package within Tidyverse that provides functions to read data from common file formats, including CSV, TSV, and fixed-width files. It’s designed for fast and efficient data reading with user-friendly parsing and type conversion.

To read data from a CSV file, we can use the read_csv() function:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

url <- 'https://www.dropbox.com/scl/fi/6pctzyujt0mgvxvxbtmvw/ah_data_02.csv?rlkey=fqbc76dlqzp22ie6qm3urd3jv&dl=1'

# Import a CSV file

csv_data <- read_csv(url)

Rows: 26 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Team
dbl (9): Pos, Pl, W, D, L, F, A, GD, Pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(csv_data)

# A tibble: 6 × 10
    Pos Team                 Pl     W     D     L     F     A    GD   Pts
  <dbl> <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 Arsenal              30    23     4     3    72    29    43    73
2     2 Manchester City      29    21     4     4    75    27    48    67
3     3 Newcastle United     29    15    11     3    48    21    27    56
4     4 Manchester United    29    17     5    42    44    37     7    56
5     5 Tottenham Hotspur    30    16     5     9    55    42    13    53
6     6 Aston Villa          60    14     5    11    41    40     1    47

For other file formats, we can use the corresponding tidyverse read functions, such as read_tsv() for TSV files or read_fwf() for fixed-width files.

Writing Data - `readr`

readr also provides functions for writing data to common file formats, like CSV and TSV. For example, the write_csv() function can be used to write a data frame to a CSV file:

# Write data to a CSV file
write_csv(csv_data, "output.csv")

Similarly, we can use the write_tsv() function to write data to a TSV file.

7.5 Data cleaning and transformation

dplyr is another core package within tidyverse that provides a set of tools for data manipulation, such as filtering, selecting, and summarising data.

The following procedures are some of the most common ones you’ll want to use when preparing your data prior to analysis.

Remember to keep using your environment window to check what’s being produced by these code snippets.

Filtering rows with a specific value, and selecting specific columns

# Load the required packages
library(tidyverse)

# create a sample tibble
data <- tibble(
 id = 1:5,
 category = c("A", "B", "A", "B", "A"),
 value = c(23, 45, 12, 78, 37)

)

# Filter rows with category 'A' and select columns 'id' and 'value'

filtered_data <- data %>%
 filter(category == "A") %>%
 select(id, value)
print(filtered_data)

# A tibble: 3 × 2
     id value
  <int> <dbl>
1     1    23
2     3    12
3     5    37

Selecting specific columns

# Select columns by name

selected_data <- data %>%
 select(id, category, value)

Sorting your data by column values

# Sort data in ascending order

sorted_data <- data %>%
 arrange(value)

# Sort data in descending order

sorted_data <- data %>%
 arrange(desc(value))

Creating new columns

# Add a new column with calculated values

new_data <- data %>%
 mutate(new_column = id * value)

Grouping data, and performing aggregations

# Group data by a column and calculate the mean of another column

grouped_data <- data %>%
 group_by(category) %>%
 summarize(mean_value = mean(value))

Combining reading, writing, and manipulating data

You can chain together reading, manipulating, and writing data using the %\>% pipe operator from tidyverse.

This allows for a more readable and efficient workflow.

In the following example, we read data from a CSV file, filter it, and then write it back to a new CSV file.

# Read data from a CSV file, filter, and write to a new CSV file

read_csv(url) %>%
 filter(Pl == "30") %>%
 select(Team, W) %>%
 write_csv("filtered_data.csv")

Rows: 26 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Team
dbl (9): Pos, Pl, W, D, L, F, A, GD, Pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

When you run this code, can you interpret what is happening? Check your files and you should see a new CSV file called ‘filtered_data.csv’.

7.6 Data reshaping with `tidyr`

tidyr is another core tidyverse package that provides functions for cleaning and reshaping data. It helps create “tidy” data, where each variable is a column, and each observation is a row.

7.7 Gather multiple columns into key-value pairs (wide to long format)

# Gather columns 'column1', 'column2', and 'column3' into key-value pairs

long_data <- data %>%
gather(key = "variable", value = "value", id, category, value)

7.8 Spread key-value pairs into separate columns (long to wide format)

# Spread key-value pairs in 'id' and 'value' columns into separate columns

wide_data <- data %>%
spread(key = "id", value = "value")

7.9 Separate a single column into multiple columns

# Separate 'column_name' into two new columns 'column1' and 'column2', splitting by a delimiter (e.g., '-')

separated_data <- data %>%
 separate(value, into = c("column1", "column2"), sep = "-")

Warning: Expected 2 pieces. Missing pieces filled with `NA` in 5 rows [1, 2, 3,
4, 5].

Unite multiple columns into a single column

# Unite columns into a new column 'column_name', with delimiter (e.g., '-')

united_data <- data %>%
 unite(column_name, id, value, sep = "-")

7.10 Alternative Approaches

In the preceding sections, we used the tidyverse package to conduct various operations on a newly-imported dataset.

Before tidyverse was introduced, it was possible to conduct the same operations, using the following code.

Many coders still use these commands, and it is important to be familiar with them even if you don’t use them yourself.

Reading a .csv file

rm(list=ls()) # this code cleans my environment

# load library
library(dplyr)

# import data

url <- 'https://www.dropbox.com/scl/fi/6pctzyujt0mgvxvxbtmvw/ah_data_02.csv?rlkey=fqbc76dlqzp22ie6qm3urd3jv&dl=1'

data <- read.csv(url)
head(data) # display the first six rows

  Pos              Team Pl  W  D  L  F  A GD Pts
1   1           Arsenal 30 23  4  3 72 29 43  73
2   2   Manchester City 29 21  4  4 75 27 48  67
3   3  Newcastle United 29 15 11  3 48 21 27  56
4   4 Manchester United 29 17  5 42 44 37  7  56
5   5 Tottenham Hotspur 30 16  5  9 55 42 13  53
6   6       Aston Villa 60 14  5 11 41 40  1  47

str(data)  # inspect variable types

'data.frame':   26 obs. of  10 variables:
 $ Pos : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Team: chr  "Arsenal" "Manchester City" "Newcastle United" "Manchester United" ...
 $ Pl  : int  30 29 29 29 30 60 28 29 30 29 ...
 $ W   : int  23 21 15 17 16 14 13 12 10 11 ...
 $ D   : int  4 4 11 5 5 5 7 8 999 6 ...
 $ L   : int  3 4 3 42 9 11 8 9 7 12 ...
 $ F   : int  72 75 48 44 55 41 52 50 47 39 ...
 $ A   : int  29 27 21 37 42 40 NA 35 40 40 ...
 $ GD  : int  43 48 27 7 13 1 16 15 7 -1 ...
 $ Pts : int  73 67 56 56 53 47 46 44 43 39 ...

Remove outliers (‘999’)

# Check each row if 999 is present and, if so, replace with 'NA'

data[data == 999, ]

      Pos Team Pl  W  D  L  F  A GD Pts
NA     NA <NA> NA NA NA NA NA NA NA  NA
NA.1   NA <NA> NA NA NA NA NA NA NA  NA
NA.2   NA <NA> NA NA NA NA NA NA NA  NA
NA.3   NA <NA> NA NA NA NA NA NA NA  NA
NA.4   NA <NA> NA NA NA NA NA NA NA  NA
NA.5   NA <NA> NA NA NA NA NA NA NA  NA
NA.6   NA <NA> NA NA NA NA NA NA NA  NA
NA.7   NA <NA> NA NA NA NA NA NA NA  NA
NA.8   NA <NA> NA NA NA NA NA NA NA  NA
NA.9   NA <NA> NA NA NA NA NA NA NA  NA
NA.10  NA <NA> NA NA NA NA NA NA NA  NA
NA.11  NA <NA> NA NA NA NA NA NA NA  NA
NA.12  NA <NA> NA NA NA NA NA NA NA  NA
NA.13  NA <NA> NA NA NA NA NA NA NA  NA
NA.14  NA <NA> NA NA NA NA NA NA NA  NA
NA.15  NA <NA> NA NA NA NA NA NA NA  NA
NA.16  NA <NA> NA NA NA NA NA NA NA  NA
NA.17  NA <NA> NA NA NA NA NA NA NA  NA
NA.18  NA <NA> NA NA NA NA NA NA NA  NA
NA.19  NA <NA> NA NA NA NA NA NA NA  NA
NA.20  NA <NA> NA NA NA NA NA NA NA  NA
NA.21  NA <NA> NA NA NA NA NA NA NA  NA
NA.22  NA <NA> NA NA NA NA NA NA NA  NA
NA.23  NA <NA> NA NA NA NA NA NA NA  NA
NA.24  NA <NA> NA NA NA NA NA NA NA  NA
NA.25  NA <NA> NA NA NA NA NA NA NA  NA
NA.26  NA <NA> NA NA NA NA NA NA NA  NA
NA.27  NA <NA> NA NA NA NA NA NA NA  NA
NA.28  NA <NA> NA NA NA NA NA NA NA  NA
NA.29  NA <NA> NA NA NA NA NA NA NA  NA
NA.30  NA <NA> NA NA NA NA NA NA NA  NA
NA.31  NA <NA> NA NA NA NA NA NA NA  NA
NA.32  NA <NA> NA NA NA NA NA NA NA  NA
NA.33  NA <NA> NA NA NA NA NA NA NA  NA
NA.34  NA <NA> NA NA NA NA NA NA NA  NA
NA.35  NA <NA> NA NA NA NA NA NA NA  NA
NA.36  NA <NA> NA NA NA NA NA NA NA  NA
NA.37  NA <NA> NA NA NA NA NA NA NA  NA
NA.38  NA <NA> NA NA NA NA NA NA NA  NA
NA.39  NA <NA> NA NA NA NA NA NA NA  NA
NA.40  NA <NA> NA NA NA NA NA NA NA  NA
NA.41  NA <NA> NA NA NA NA NA NA NA  NA
NA.42  NA <NA> NA NA NA NA NA NA NA  NA
NA.43  NA <NA> NA NA NA NA NA NA NA  NA
NA.44  NA <NA> NA NA NA NA NA NA NA  NA
NA.45  NA <NA> NA NA NA NA NA NA NA  NA
NA.46  NA <NA> NA NA NA NA NA NA NA  NA
NA.47  NA <NA> NA NA NA NA NA NA NA  NA
NA.48  NA <NA> NA NA NA NA NA NA NA  NA
NA.49  NA <NA> NA NA NA NA NA NA NA  NA
NA.50  NA <NA> NA NA NA NA NA NA NA  NA
NA.51  NA <NA> NA NA NA NA NA NA NA  NA
NA.52  NA <NA> NA NA NA NA NA NA NA  NA
NA.53  NA <NA> NA NA NA NA NA NA NA  NA
NA.54  NA <NA> NA NA NA NA NA NA NA  NA
NA.55  NA <NA> NA NA NA NA NA NA NA  NA
NA.56  NA <NA> NA NA NA NA NA NA NA  NA
NA.57  NA <NA> NA NA NA NA NA NA NA  NA
NA.58  NA <NA> NA NA NA NA NA NA NA  NA

Removing rows with missing data

data01 <- na.omit(data) # removes any row (observation) with missing data

Remove a variable ‘A’ from dataset ‘data’

data02<- subset(data01, select = -c(A))

Rename variables in dataset ‘data’ and create a new dataset ‘data02’

data03 <-rename(data02, league_position = Pos, games_played = Pl) # create the new dataset

rm(data, data01, data02) # remove original dataset from memory

Change variable types

data03$Team = as.factor(data03$Team) # make team a factor

Create a new logical true/false variable based on existing variables

We can create a TRUE/FALSE variable (logical) based on how many points a team has accrued.

# first, we replace the existing values with TRUE or FALSE

data03$over_50[data03$Pts >=50] <- TRUE
data03$over_50[data03$Pts <50] <- FALSE

# then, we change the variable type to logical
data03$over_50 = as.logical(data03$over_50)

# finally, we check out dataset and make sure the variable type has changed
head(data03) # show first six rows

  league_position              Team games_played  W  D  L  F GD Pts over_50
1               1           Arsenal           30 23  4  3 72 43  73    TRUE
2               2   Manchester City           29 21  4  4 75 48  67    TRUE
3               3  Newcastle United           29 15 11  3 48 27  56    TRUE
4               4 Manchester United           29 17  5 42 44  7  56    TRUE
5               5 Tottenham Hotspur           30 16  5  9 55 13  53    TRUE
6               6       Aston Villa           60 14  5 11 41  1  47   FALSE

str(data03) # inspect variable types

'data.frame':   16 obs. of  10 variables:
 $ league_position: int  1 2 3 4 5 6 8 9 10 12 ...
 $ Team           : Factor w/ 16 levels "Arsenal","Aston Villa",..: 1 10 12 11 15 2 9 3 6 4 ...
 $ games_played   : int  30 29 29 29 30 60 29 30 29 30 ...
 $ W              : int  23 21 15 17 16 14 12 10 11 8 ...
 $ D              : int  4 4 11 5 5 5 8 999 6 9 ...
 $ L              : int  3 4 3 42 9 11 9 7 12 13 ...
 $ F              : int  72 75 48 44 55 41 50 47 39 29 ...
 $ GD             : int  43 48 27 7 13 1 15 7 -1 -11 ...
 $ Pts            : int  73 67 56 56 53 47 44 43 39 33 ...
 $ over_50        : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...